library(lavaan)
library(dplyr)
library(reticulate)
library(marginaleffects)
library(modelsummary)
library(ggplot2)
library(tidyr)
library(egg)
library(lme4)
library(semPlot)
library(tinytable)
library(kableExtra)
library(reshape2)
reticulate::py_run_string("import pymc as pm")
options(rstudio.python.installationPath = "/Users/nathanielforde/mambaforge/envs")
options("modelsummary_factory_default" = "tinytable")
options(repr.plot.width=15, repr.plot.height=8)
knitr::knit_engines$set(python = reticulate::eng_python)
options(scipen=999)
set.seed(130)WORK IN PROGRESS:
Choices are made as we step down the garden of forking paths in the course of any analysis of data. We risk self-delusion and foolish misstep if we cling too strongly to an arbitrary path and we risk impoverished and inadequate theory if we refuse to explore many paths. In this post we will elaborate the choices related to analysing psychometrics surveys. Psychometric phenomena are inherently complex and imprecisely measured. Precise attention is needed to test assumptions used to analyse this species of data, for these models highlight a key aspect of the design choices made in statistical inference. There is a creativity in model development and a requirement for rigor when we structure the inferences derivable from our model. We show how these confirmatory and exploratory modes of inference play-out in the Confirmatory Factor Analysis with Structural equations in PyMC under the modern Bayesian workflow.
Measurment and Measurement Constructs
Science is easier when there is a recipe. When there is some procedure to adopt or routine to ape, you can out-source the responsibility for methodological justification. One egregious pattern in this vein attempts to masks implausible nonsense with the much vaunted mathematical gloss of “statistical significance”. Seen from 1000 feet, this is disappointing but not surprising. Lip-service is paid to the idea of scientific method and we absolve ourselves of the requirement for principled justification and substantive argument.
Evidence should be marshaled in service to argument, and it’s an absurd pretense to claim that data speaks for itself in this argument. Good and compelling argumentation is at the heart of any sound inference. It is a necessary obligation if you expect anyone to make any decision on the strength of your evidence. Procedure and routine tests offer only a facsimile of sound argument and p-values are a poor substitute for logical consequence.
Data is found, gathered or maybe even carefully curated. In all cases there is need for a defensive posture, an argument that the data is fit-for-purpose. No where is this more clear than in psychometrics where the data is often derived from a strategically constructed survey aimed at a particular target phenomena. Some intuited, but not yet measured, concept that arguably plays a role in human action, motivation or sentiment. The “fuzziness” of the subject matter in psychometrics has had a catalyzing effect on the methodological rigour sought in the science. Survey designs are agonised over for correct tone and rhythm of sentence structure. Analysis steps are justified and tested under a wealth of modelling routines. Model architectures are defined and refined to better express the hypothesized structures in the data-generating process.
We will examine a smattering of choices available in the analysis of psychometric survey data. We will first step through these analysis routines using lmer, lavaan before switching to PyMC to demonstrate how to fit Confirmatory Factor Analysis models and Structural Equation Models in a Bayesian fashion using a probabilistic programming language.Throughout we’ll draw on Levy and Mislevy’s Bayesian Psychometric Modeling
The Data
The data is borrowed from work here demonstrating SEM modelling with Lavaan. We’ll load it up and begin some exploratory work.
df = read.csv('sem_data.csv')
df$ls_sum <- df$ls_p1 + df$ls_p2 + df$ls_p3
df$ls_mean <- rowMeans(df[ c('ls_p1', 'ls_p2', 'ls_p3')])
head(df) |> kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")| ID | region | gender | age | se_acad_p1 | se_acad_p2 | se_acad_p3 | se_social_p1 | se_social_p2 | se_social_p3 | sup_friends_p1 | sup_friends_p2 | sup_friends_p3 | sup_parents_p1 | sup_parents_p2 | sup_parents_p3 | ls_p1 | ls_p2 | ls_p3 | ls_sum | ls_mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | west | female | 13 | 4.857143 | 5.571429 | 4.500000 | 5.80 | 5.500000 | 5.40 | 6.5 | 6.5 | 7.0 | 7.0 | 7.0 | 6.0 | 5.333333 | 6.75 | 5.50 | 17.58333 | 5.861111 |
| 2 | west | male | 14 | 4.571429 | 4.285714 | 4.666667 | 5.00 | 5.500000 | 4.80 | 4.5 | 4.5 | 5.5 | 5.0 | 6.0 | 4.5 | 4.333333 | 5.00 | 4.50 | 13.83333 | 4.611111 |
| 10 | west | female | 14 | 4.142857 | 6.142857 | 5.333333 | 5.20 | 4.666667 | 6.00 | 4.0 | 4.5 | 3.5 | 7.0 | 7.0 | 6.5 | 6.333333 | 5.50 | 4.00 | 15.83333 | 5.277778 |
| 11 | west | female | 14 | 5.000000 | 5.428571 | 4.833333 | 6.40 | 5.833333 | 6.40 | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | 4.333333 | 6.50 | 6.25 | 17.08333 | 5.694444 |
| 12 | west | female | 14 | 5.166667 | 5.600000 | 4.800000 | 5.25 | 5.400000 | 5.25 | 7.0 | 7.0 | 7.0 | 6.5 | 6.5 | 7.0 | 5.666667 | 6.00 | 5.75 | 17.41667 | 5.805556 |
| 14 | west | male | 14 | 4.857143 | 4.857143 | 4.166667 | 5.20 | 5.000000 | 4.20 | 5.5 | 6.5 | 7.0 | 6.5 | 6.5 | 6.5 | 5.000000 | 5.50 | 5.50 | 16.00000 | 5.333333 |
The hypothetical dependency structure in this life-satisfaction data set posits a moderations relationship between scores related to life-satisfaction, parental and family support and self-efficacy. It is not a trivial task to be able to design a survey that can elicit answers plausibly mapped to each of these “factors” or themes, never mind finding a model of their relationship that can inform us as to the relative of impact of each on life-satisfaction outcomes.
We can pull out the high level summary statistics to better observe the level of information in our metrics. These metrics are thematically clustered because the source survey deliberately targetted each of the underlying themes.
#| code-fold: true
#| code-summary: "Show the code"
datasummary_skim(df)|>
style_tt(
i = 15:17,
j = 1:1,
background = "#20AACC",
color = "white",
italic = TRUE) |>
style_tt(
i = 18:19,
j = 1:1,
background = "#2888A0",
color = "white",
italic = TRUE) |>
style_tt(
i = 2:14,
j = 1:1,
background = "#17C2AD",
color = "white",
italic = TRUE)| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| ID | 283 | 0 | 187.9 | 106.3 | 1.0 | 201.0 | 367.0 | |
| age | 5 | 0 | 14.7 | 0.8 | 13.0 | 15.0 | 17.0 | |
| se_acad_p1 | 32 | 0 | 5.2 | 0.8 | 3.1 | 5.1 | 7.0 | |
| se_acad_p2 | 36 | 0 | 5.3 | 0.7 | 3.1 | 5.4 | 7.0 | |
| se_acad_p3 | 29 | 0 | 5.2 | 0.8 | 2.8 | 5.2 | 7.0 | |
| se_social_p1 | 24 | 0 | 5.3 | 0.8 | 1.8 | 5.4 | 7.0 | |
| se_social_p2 | 27 | 0 | 5.5 | 0.7 | 2.7 | 5.5 | 7.0 | |
| se_social_p3 | 31 | 0 | 5.4 | 0.8 | 3.0 | 5.5 | 7.0 | |
| sup_friends_p1 | 13 | 0 | 5.8 | 1.1 | 1.0 | 6.0 | 7.0 | |
| sup_friends_p2 | 10 | 0 | 6.0 | 0.9 | 2.5 | 6.0 | 7.0 | |
| sup_friends_p3 | 13 | 0 | 6.0 | 1.1 | 1.0 | 6.0 | 7.0 | |
| sup_parents_p1 | 11 | 0 | 6.0 | 1.1 | 1.0 | 6.0 | 7.0 | |
| sup_parents_p2 | 11 | 0 | 5.9 | 1.1 | 2.0 | 6.0 | 7.0 | |
| sup_parents_p3 | 13 | 0 | 5.7 | 1.1 | 1.0 | 6.0 | 7.0 | |
| ls_p1 | 15 | 0 | 5.2 | 0.9 | 2.0 | 5.3 | 7.0 | |
| ls_p2 | 21 | 0 | 5.8 | 0.7 | 2.5 | 5.8 | 7.0 | |
| ls_p3 | 22 | 0 | 5.2 | 0.9 | 2.0 | 5.2 | 7.0 | |
| ls_sum | 98 | 0 | 16.2 | 2.1 | 8.7 | 16.4 | 20.8 | |
| ls_mean | 97 | 0 | 5.4 | 0.7 | 2.9 | 5.5 | 6.9 | |
| N | % | |||||||
| region | east | 142 | 50.2 | |||||
| west | 141 | 49.8 | ||||||
| gender | female | 132 | 46.6 | |||||
| male | 151 | 53.4 |
Note how we’ve distinguished among the metrics for the “outcome” metrics and the “driver” metrics. Such a distinction may seem trivial, but it is only possible because we bring substantive knowledge to bear on the problem in the design of our survey and the postulation of the theoretical construct. Our data is a multivariate outcome with a large range of possible interacting effects and the patterns of realisation for our life-satisfaction scores may be quite different than the hypothesised structure. It is this open question that we’re aiming to uncover in the analysis of psychometrics surveys.
Sample Covariances and Correlations
drivers = c('se_acad_p1', 'se_acad_p2', 'se_acad_p3', 'se_social_p1', 'se_social_p2', 'se_social_p3', 'sup_friends_p1','sup_friends_p2', 'sup_friends_p3', 'sup_parents_p1' , 'sup_parents_p2' , 'sup_parents_p3', 'ls_p1', 'ls_p2', 'ls_p3')
plot_heatmap <- function(df, title="Sample Covariances", subtitle="Observed Measures") {
heat_df = df |> as.matrix() |> melt()
colnames(heat_df) <- c("x", "y", "value")
g <- heat_df |> mutate(value = round(value, 2)) |> ggplot(aes(x = x, y = y, fill = value)) +
geom_tile() + geom_text(aes(label = value), color = "black", size = 4) +
scale_fill_gradient2(
high = 'dodgerblue4',
mid = 'white',
low = 'firebrick2'
) + theme(axis.text.x = element_text(angle=45)) + ggtitle(title, subtitle)
g
}
g1 = plot_heatmap(cov(df[, drivers]))
g2 = plot_heatmap(cor(df[, drivers]), "Sample Correlations")
plot <- ggarrange(g1,g2, ncol=1, nrow=2);Note the relatively strong correlations between measures of parental support and the life-satisfaction outcome ls_p3. Similarly, how the social self-efficacy scores seem similarly correlated with the secondary life satisfaction indicator ls_p2. These observed correlations merit some further investigation. We can plot the pairs of scatter plots.
df <- df |>
mutate(id = row_number())
# Prepare data to be plotted on the x axis
x_vars <- pivot_longer(data = df,
cols = se_acad_p1:ls_p3,
names_to = "variable_x",
values_to = "x")
# Prepare data to be plotted on the y axis
y_vars <- pivot_longer(data = df,
cols = se_acad_p1:ls_p3,
names_to = "variable_y",
values_to = "y")
# Join data for x and y axes and make plot
full_join(x_vars, y_vars,
by = c("id"),
relationship = "many-to-many") |>
ggplot() +
aes(x = x, y = y) +
geom_point() + geom_smooth(method='lm') +
facet_grid(c("variable_x", "variable_y")) + ggtitle("Pair Plot of Indicator Metrics",
"Comparing Against Life Satisfaction Scores")`geom_smooth()` using formula = 'y ~ x'
The scatter plots among the highly correlated variables in the heatmap do seem to exhibit some kind of linear relationship with aspects of the life-satisfaction scores. We now turn to modelling these relationships to tease out some kind of inferential summary of that relationship.
Fit Initial Regression Models
To model these relationships we make use of the aggregated sum and mean scores to model the relationships between these indicator metrics and life-satisfaction. We fit a variety of models each one escalating in the number of indicator metrics we incorporate into our model of the life-satisfaction outcome. This side-steps the multivariate nature of hypothesised constructs and crudely amalgamates the indicator metrics. This may be more or less justified depending on how similar in theme the three outcome questions ls_p1, ls_p2, ls_p3 are in nature.
formula_sum_1st = " ls_sum ~ se_acad_p1 + se_social_p1 + sup_friends_p1 + sup_parents_p1"
formula_mean_1st = " ls_mean ~ se_acad_p1 + se_social_p1 + sup_friends_p1 + sup_parents_p1"
formula_sum_12 = " ls_sum ~ se_acad_p1 + se_acad_p2 + se_social_p1 + se_social_p2 +
sup_friends_p1 + sup_friends_p2 + sup_parents_p1 + sup_parents_p2"
formula_mean_12 = " ls_mean ~ se_acad_p1 + se_acad_p2 + se_social_p1 + se_social_p2 +
sup_friends_p1 + sup_friends_p2 + sup_parents_p1 + sup_parents_p2"
formula_sum = " ls_sum ~ se_acad_p1 + se_acad_p2 + se_acad_p3 + se_social_p1 + se_social_p2 + se_social_p3 + sup_friends_p1 + sup_friends_p2 + sup_friends_p3 + sup_parents_p1 + sup_parents_p2 + sup_parents_p3"
formula_mean = " ls_mean ~ se_acad_p1 + se_acad_p2 + se_acad_p3 + se_social_p1 + se_social_p2 + se_social_p3 + sup_friends_p1 + sup_friends_p2 + sup_friends_p3 + sup_parents_p1 + sup_parents_p2 + sup_parents_p3"
mod_sum = lm(formula_sum, df)
mod_sum_1st = lm(formula_sum_1st, df)
mod_sum_12 = lm(formula_sum_12, df)
mod_mean = lm(formula_mean, df)
mod_mean_1st = lm(formula_mean_1st, df)
mod_mean_12 = lm(formula_mean_12, df)
min_max_norm <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
df_norm <- as.data.frame(lapply(df[c(5:19)], min_max_norm))
df_norm$ls_sum <- df$ls_sum
df_norm$ls_mean <- df$ls_mean
mod_sum_norm = lm(formula_sum, df_norm)
mod_mean_norm = lm(formula_mean, df_norm)
models = list(
"Outcome: sum_score" = list("model_sum_1st_factors" = mod_sum_1st,
"model_sum_1st_2nd_factors" = mod_sum_12,
"model_sum_score" = mod_sum,
"model_sum_score_norm" = mod_sum_norm
),
"Outcome: mean_score" = list(
"model_mean_1st_factors" = mod_mean_1st,
"model_mean_1st_2nd_factors" = mod_mean_12,
"model_mean_score"= mod_mean,
"model_mean_score_norm" = mod_mean_norm
)
)The classical presentation of regression models reports the coefficient weights accorded to each of the input variables. We present these models to highlight that the manner in which we represent our theoretical constructs has ramifications for the interpretation of the data generating process. In particular, note how different degrees of significance are accorded to the different variables depending on which variables are included.
#| code-fold: true
#| code-summary: "Show the code"
modelsummary(models, stars=TRUE, shape ="cbind") |>
style_tt(
i = 2:25,
j = 1:1,
background = "#17C2AD",
color = "white",
italic = TRUE)| Outcome: sum_score | Outcome: mean_score | |||||||
|---|---|---|---|---|---|---|---|---|
| model_sum_1st_factors | model_sum_1st_2nd_factors | model_sum_score | model_sum_score_norm | model_mean_1st_factors | model_mean_1st_2nd_factors | model_mean_score | model_mean_score_norm | |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||||||
| (Intercept) | 5.118*** | 2.644** | 2.094* | 7.783*** | 1.706*** | 0.881** | 0.698* | 2.594*** |
| (0.907) | (0.985) | (0.954) | (0.658) | (0.302) | (0.328) | (0.318) | (0.219) | |
| se_acad_p1 | 0.242 | -0.034 | -0.208 | -0.800 | 0.081 | -0.011 | -0.069 | -0.267 |
| (0.147) | (0.180) | (0.192) | (0.742) | (0.049) | (0.060) | (0.064) | (0.247) | |
| se_social_p1 | 1.088*** | 0.501* | 0.355+ | 1.846+ | 0.363*** | 0.167* | 0.118+ | 0.615+ |
| (0.162) | (0.204) | (0.200) | (1.039) | (0.054) | (0.068) | (0.067) | (0.346) | |
| sup_friends_p1 | 0.125 | -0.224+ | -0.272+ | -1.630+ | 0.042 | -0.075+ | -0.091+ | -0.543+ |
| (0.088) | (0.133) | (0.150) | (0.901) | (0.029) | (0.044) | (0.050) | (0.300) | |
| sup_parents_p1 | 0.561*** | 0.238+ | 0.072 | 0.432 | 0.187*** | 0.079+ | 0.024 | 0.144 |
| (0.100) | (0.141) | (0.143) | (0.858) | (0.033) | (0.047) | (0.048) | (0.286) | |
| se_acad_p2 | 0.448* | 0.327 | 1.261 | 0.149* | 0.109 | 0.420 | ||
| (0.197) | (0.202) | (0.779) | (0.066) | (0.067) | (0.260) | |||
| se_social_p2 | 0.756*** | 0.509* | 2.206* | 0.252*** | 0.170* | 0.735* | ||
| (0.213) | (0.219) | (0.949) | (0.071) | (0.073) | (0.316) | |||
| sup_friends_p2 | 0.369* | 0.331* | 1.490* | 0.123* | 0.110* | 0.497* | ||
| (0.157) | (0.160) | (0.720) | (0.052) | (0.053) | (0.240) | |||
| sup_parents_p2 | 0.370** | 0.118 | 0.591 | 0.123** | 0.039 | 0.197 | ||
| (0.138) | (0.144) | (0.719) | (0.046) | (0.048) | (0.240) | |||
| se_acad_p3 | 0.153 | 0.637 | 0.051 | 0.212 | ||||
| (0.174) | (0.726) | (0.058) | (0.242) | |||||
| se_social_p3 | 0.443** | 1.771** | 0.148** | 0.590** | ||||
| (0.161) | (0.642) | (0.054) | (0.214) | |||||
| sup_friends_p3 | 0.165 | 0.989 | 0.055 | 0.330 | ||||
| (0.130) | (0.782) | (0.043) | (0.261) | |||||
| sup_parents_p3 | 0.526*** | 3.158*** | 0.175*** | 1.053*** | ||||
| (0.126) | (0.754) | (0.042) | (0.251) | |||||
| Num.Obs. | 283 | 283 | 283 | 283 | 283 | 283 | 283 | 283 |
| R2 | 0.399 | 0.467 | 0.517 | 0.517 | 0.399 | 0.467 | 0.517 | 0.517 |
| R2 Adj. | 0.390 | 0.451 | 0.496 | 0.496 | 0.390 | 0.451 | 0.496 | 0.496 |
| AIC | 1090.9 | 1064.7 | 1044.7 | 1044.7 | 469.1 | 442.9 | 422.9 | 422.9 |
| BIC | 1112.8 | 1101.2 | 1095.7 | 1095.7 | 491.0 | 479.4 | 473.9 | 473.9 |
| Log.Lik. | -539.455 | -522.373 | -508.341 | -508.341 | -228.548 | -211.466 | -197.434 | -197.434 |
| RMSE | 1.63 | 1.53 | 1.46 | 1.46 | 0.54 | 0.51 | 0.49 | 0.49 |
We can similarly plot the coefficient values and their uncertainty highlighting how the representation or scaling of the variables impact the scale of the coefficient weights and therefore the surety of any subsequent claims.
models = list(
"model_sum_1st_factors" = mod_sum_1st,
"model_sum_1st_2nd_factors" = mod_sum_12,
"model_sum_score" = mod_sum,
"model_sum_score_norm" = mod_sum_norm,
"model_mean_1st_factors" = mod_mean_1st,
"model_mean_1st_2nd_factors" = mod_mean_12,
"model_mean_score"= mod_mean,
"model_mean_score_norm" = mod_mean_norm
)
modelplot(models, coef_omit = 'Intercept') + geom_vline(xintercept = 0, linetype="dotted",
color = "black") + ggtitle("Comparing Model Parameter Estimates", "Across Covariates")Significant Coefficients?
An alternative lens on these figures highlights the statistical significance of the coefficients. But again, these criteria are much abused. Significance at what level? Conditional on which representation?
g1 = modelplot(mod_sum, coef_omit = 'Intercept') + aes(color = ifelse(p.value < 0.001, "Significant at 0.001", "Not significant at 0.001")) +
scale_color_manual(values = c("grey", "blue")) + ggtitle("Significance of Coefficient Values", "At Different Levels")
g2 = modelplot(mod_mean, coef_omit = 'Intercept') + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant at 0.05")) +
scale_color_manual(values = c("grey", "black"))
plot <- ggarrange(g1,g2, ncol=2, nrow=1);Aggregate Driver Scores
Perhaps we play with the feature representation and increase the proportion of significant indicators. Can we now tell a more definitive story about how parental support and social self-efficact are determinants of self-reported life-satisfaction scores? Let’s focus here on the sum score representation and add interaction effects.
df$se_acad_mean <- rowMeans(df[c('se_acad_p1', 'se_acad_p2', 'se_acad_p3')])
df$se_social_mean <- rowMeans(df[c('se_social_p1', 'se_social_p2', 'se_social_p3')])
df$sup_friends_mean <- rowMeans(df[c('sup_friends_p1', 'sup_friends_p2', 'sup_friends_p3')])
df$sup_parents_mean <- rowMeans(df[c('sup_parents_p1', 'sup_parents_p2', 'sup_parents_p3')])
formula_parcel_sum = "ls_sum ~ se_acad_mean + se_social_mean +
sup_friends_mean + sup_parents_mean " #sup_parents_mean*se_social_mean"
formula_parcel_sum_inter = "ls_sum ~ se_acad_mean + se_social_mean +
sup_friends_mean + sup_parents_mean + sup_parents_mean*se_social_mean"
mod_sum_parcel = lm(formula_parcel_sum, df)
mod_sum_inter_parcel = lm(formula_parcel_sum_inter, df)
models_parcel = list("model_sum_score" = mod_sum_parcel,
"model_sum_inter_score"= mod_sum_inter_parcel
)
modelsummary(models_parcel, stars=TRUE)| model_sum_score | model_sum_inter_score | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| (Intercept) | 2.728** | -6.103+ |
| (0.931) | (3.356) | |
| se_acad_mean | 0.307+ | 0.370* |
| (0.158) | (0.158) | |
| se_social_mean | 1.269*** | 2.859*** |
| (0.175) | (0.606) | |
| sup_friends_mean | 0.124 | 0.183+ |
| (0.097) | (0.099) | |
| sup_parents_mean | 0.726*** | 2.242*** |
| (0.099) | (0.562) | |
| se_social_mean × sup_parents_mean | -0.292** | |
| (0.107) | ||
| Num.Obs. | 283 | 283 |
| R2 | 0.489 | 0.503 |
| R2 Adj. | 0.482 | 0.494 |
| AIC | 1044.6 | 1039.0 |
| BIC | 1066.4 | 1064.5 |
| Log.Lik. | -516.288 | -512.513 |
| RMSE | 1.50 | 1.48 |
What does definitive mean here? Is it so simple as more significant coefficents? Marginally better performance measures?
g1 = modelplot(mod_sum_parcel, coef_omit = 'Intercept') + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant at 0.05")) +
scale_color_manual(values = c("grey", "blue")) + ggtitle("Significance of Coefficient Values", "At Different Levels for Sum and Mean Scores Life Satisfaction ")
g2 = modelplot(mod_sum_inter_parcel, coef_omit = 'Intercept') + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant at 0.05")) +
scale_color_manual(values = c("grey", "blue"))
plot <- ggarrange(g1,g2, ncol=2, nrow=1);This kind of brinkmanship is brittle. Any one of these kinds of choice can be justified but more often than not results from an suspect exploratory process. Steps down a “garden of forking paths” seeking some kind of story to justify an analysis or promote a conclusion. This post-hoc “seeking” is just bad science undermining the significance claims that accrue to reliable procedures. It warps the nature of testing procedure by corrupting the assumed consistency of repeatable trials. The guarantees of statistical significance attach to a conclusion just when the procedure is imagined replicable and repeated under identical conditions. By exploring the different representations and criteria of narrative adequacy we break those guarantees.
Exploratory and Confirmatory Modes
One of the things that psychometrics has pioneered well is the distinction between an exploratory and confirmatory models. This distinction, when made explicit, partially guards against the abuse of inferential integrity we see in more common work-flows. But additionally, models are often opaque - you may (as above) have improved some measure of model fit, changed the parameter weighting accorded to an observed feature, but what does that mean? Exploration of model architectures, design choices and feature creation is just how we come to understand the meaning of a model specification. Even in the simple case of regression we’ve seen how adding an interaction term changes the interpretability of a model. How then are we to stand behind uncertainty estimates accorded to parameter weights when we barely intuit the implications of a model design?
Marginal Effects
The answer is to not to rely on intuition, but push forward and test the tangible implications of a fitted model. A model is hypothesis which should be applied to stringent test. We should subject the logical consequences of the design to appropriate scrutiny. We understand the implications and relative plausibility of any model in terms of the predicted outcomes more easily than we understand the subtle interaction effects expressed parameter movements. As such we should adopt this view in our evaluation of a model fit too.
Consider how we do this using Vincent Arel-Bundock’s wonderful marginaleffects package passing a grid of new values through to our fitted model.These implications are a test of model plausibility too.
pred <- predictions(mod_sum_parcel, newdata = datagrid(sup_parents_mean = 1:10, se_social_mean = 1:10 ))
pred1 <- predictions(mod_sum_inter_parcel, newdata = datagrid(sup_parents_mean = 1:10, se_social_mean = 1:10))
pred1 |> tail(10) |> kableExtra::kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")| rowid | estimate | std.error | statistic | p.value | s.value | conf.low | conf.high | se_acad_mean | sup_friends_mean | sup_parents_mean | se_social_mean | ls_sum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 91 | 91 | 19.27615 | 2.2881204 | 8.424449 | 0 | 54.61499 | 14.79152 | 23.76079 | 5.237686 | 5.929329 | 10 | 1 | 17.58333 |
| 92 | 92 | 19.21698 | 1.7840261 | 10.771691 | 0 | 87.46457 | 15.72035 | 22.71361 | 5.237686 | 5.929329 | 10 | 2 | 17.58333 |
| 93 | 93 | 19.15780 | 1.2888397 | 14.864380 | 0 | 163.60757 | 16.63172 | 21.68388 | 5.237686 | 5.929329 | 10 | 3 | 17.58333 |
| 94 | 94 | 19.09863 | 0.8188846 | 23.322730 | 0 | 397.24885 | 17.49364 | 20.70361 | 5.237686 | 5.929329 | 10 | 4 | 17.58333 |
| 95 | 95 | 19.03945 | 0.4595015 | 41.435010 | 0 | Inf | 18.13884 | 19.94006 | 5.237686 | 5.929329 | 10 | 5 | 17.58333 |
| 96 | 96 | 18.98027 | 0.5318049 | 35.690291 | 0 | 924.33454 | 17.93795 | 20.02259 | 5.237686 | 5.929329 | 10 | 6 | 17.58333 |
| 97 | 97 | 18.92110 | 0.9410614 | 20.106123 | 0 | 296.26806 | 17.07665 | 20.76554 | 5.237686 | 5.929329 | 10 | 7 | 17.58333 |
| 98 | 98 | 18.86192 | 1.4210849 | 13.272902 | 0 | 131.14398 | 16.07665 | 21.64720 | 5.237686 | 5.929329 | 10 | 8 | 17.58333 |
| 99 | 99 | 18.80274 | 1.9194980 | 9.795657 | 0 | 72.84938 | 15.04060 | 22.56489 | 5.237686 | 5.929329 | 10 | 9 | 17.58333 |
| 100 | 100 | 18.74357 | 2.4249884 | 7.729343 | 0 | 46.39459 | 13.99068 | 23.49646 | 5.237686 | 5.929329 | 10 | 10 | 17.58333 |
We will see here how the impact of changes in se_social_mean is much reduced when sup_parent_mean is held fixed at high values (9, 10) when our model allows for an interaction effect.
pred |> head(10) |> kableExtra::kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")| rowid | estimate | std.error | statistic | p.value | s.value | conf.low | conf.high | se_acad_mean | sup_friends_mean | sup_parents_mean | se_social_mean | ls_sum |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7.065559 | 0.7860786 | 8.988362 | 0 | 61.78928 | 5.524873 | 8.606245 | 5.237686 | 5.929329 | 1 | 1 | 17.58333 |
| 2 | 8.334837 | 0.6546017 | 12.732685 | 0 | 120.95075 | 7.051842 | 9.617833 | 5.237686 | 5.929329 | 1 | 2 | 17.58333 |
| 3 | 9.604115 | 0.5480173 | 17.525206 | 0 | 226.01129 | 8.530021 | 10.678209 | 5.237686 | 5.929329 | 1 | 3 | 17.58333 |
| 4 | 10.873394 | 0.4830922 | 22.507905 | 0 | 370.25976 | 9.926550 | 11.820237 | 5.237686 | 5.929329 | 1 | 4 | 17.58333 |
| 5 | 12.142672 | 0.4771468 | 25.448505 | 0 | 472.16118 | 11.207481 | 13.077862 | 5.237686 | 5.929329 | 1 | 5 | 17.58333 |
| 6 | 13.411950 | 0.5321613 | 25.202790 | 0 | 463.16951 | 12.368933 | 14.454967 | 5.237686 | 5.929329 | 1 | 6 | 17.58333 |
| 7 | 14.681228 | 0.6324223 | 23.214278 | 0 | 393.60150 | 13.441703 | 15.920753 | 5.237686 | 5.929329 | 1 | 7 | 17.58333 |
| 8 | 15.950506 | 0.7602342 | 20.981043 | 0 | 322.26019 | 14.460474 | 17.440538 | 5.237686 | 5.929329 | 1 | 8 | 17.58333 |
| 9 | 17.219784 | 0.9039855 | 19.048739 | 0 | 266.32548 | 15.448005 | 18.991563 | 5.237686 | 5.929329 | 1 | 9 | 17.58333 |
| 10 | 18.489062 | 1.0571941 | 17.488806 | 0 | 225.08895 | 16.417000 | 20.561124 | 5.237686 | 5.929329 | 1 | 10 | 17.58333 |
This modifying effect is not as evident in the simpler model.
Regression Marginal Effects
We can see this more clearly with a plot of the marginal effects.
g = plot_predictions(mod_sum_parcel, condition = c("se_social_mean", "sup_parents_mean"), type = "response") + ggtitle("Counterfactual Shift of Outcome: se_social_mean", "Holding all else Fixed: Simple Model")
g1 = plot_predictions(mod_sum_inter_parcel, condition = c("se_social_mean", "sup_parents_mean"), type = "response") + ggtitle("Counterfactual Shift of Outcome: se_social_mean", "Holding all else Fixed Interaction Model")
plot <- ggarrange(g,g1, ncol=1, nrow=2);Here we’ve pulled out some of the implications in terms of the outcome predictions we
Models with Controls
The garden of forking paths presents itself within any set of covariates. How do we represent their effects? Which interactions are meaningful? How do we argue for one model design over another? The questionable paths are multiplied when we begin to consider additional covariates and group effects. But also additional covariates help structure our expectations too. Yes, you can cut and chop your way to through the garden to find some spurious correlation but more plausibly you can bring in structurally important variables which helpfully moderate the outcomes based on our understanding of the data generating process.
Let’s assess the question but this time we allow the model to account for differences in region.
formula_no_grp_sum = "ls_sum ~ sup_parents_p1 + sup_parents_p2 + sup_parents_p3 + sup_friends_p1 + sup_friends_p2 + sup_friends_p3 + se_acad_p1 + se_acad_p2 + se_acad_p3 +
se_social_p1 + se_social_p2 + se_social_p3"
formula_grp_sum = "ls_sum ~ sup_parents_p1 + sup_parents_p2 + sup_parents_p3 + sup_friends_p1 + sup_friends_p2 + sup_friends_p3 + se_acad_p1 + se_acad_p2 + se_acad_p3 +
se_social_p1 + se_social_p2 + se_social_p3 + factor(region)"
no_grp_sum_fit <- lm(formula_no_grp_sum , data = df)
grp_sum_fit <- lm(formula_grp_sum, data = df)
g1 = modelplot(no_grp_sum_fit, re.form=NA) + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant at 0.001")) +
scale_color_manual(values = c("grey", "blue"), guide='none') + ggtitle("Significance of Coefficient Values", "No Group Effects Model")
g2 = modelplot(grp_sum_fit, re.form=NA) + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant 0.05")) +
scale_color_manual(values = c("grey", "blue")) + ggtitle("Significance of Coefficient Values", "Group Effects Model")
plot <- ggarrange(g1,g2, ncol=2, nrow=1);We can see here how the additional factor variable is reported to be significant conditional on a model specification. However the intercept is no longer well identified.
modelsummary(list("No Group Effects Fit"= no_grp_sum_fit,
"Group Effects Fit"= grp_sum_fit),
stars = TRUE) |>
style_tt(
i = 2:25,
j = 1:1,
background = "#17C2AD",
color = "white",
italic = TRUE)| No Group Effects Fit | Group Effects Fit | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| (Intercept) | 2.094* | 1.979+ |
| (0.954) | (1.041) | |
| sup_parents_p1 | 0.072 | 0.066 |
| (0.143) | (0.145) | |
| sup_parents_p2 | 0.118 | 0.122 |
| (0.144) | (0.145) | |
| sup_parents_p3 | 0.526*** | 0.529*** |
| (0.126) | (0.126) | |
| sup_friends_p1 | -0.272+ | -0.272+ |
| (0.150) | (0.150) | |
| sup_friends_p2 | 0.331* | 0.332* |
| (0.160) | (0.160) | |
| sup_friends_p3 | 0.165 | 0.166 |
| (0.130) | (0.131) | |
| se_acad_p1 | -0.208 | -0.212 |
| (0.192) | (0.193) | |
| se_acad_p2 | 0.327 | 0.328 |
| (0.202) | (0.202) | |
| se_acad_p3 | 0.153 | 0.169 |
| (0.174) | (0.184) | |
| se_social_p1 | 0.355+ | 0.345+ |
| (0.200) | (0.204) | |
| se_social_p2 | 0.509* | 0.517* |
| (0.219) | (0.221) | |
| se_social_p3 | 0.443** | 0.446** |
| (0.161) | (0.161) | |
| factor(region)west | 0.056 | |
| (0.202) | ||
| Num.Obs. | 283 | 283 |
| R2 | 0.517 | 0.517 |
| R2 Adj. | 0.496 | 0.494 |
| AIC | 1044.7 | 1046.6 |
| BIC | 1095.7 | 1101.3 |
| Log.Lik. | -508.341 | -508.300 |
| RMSE | 1.46 | 1.46 |
Again the cleanest way to interpret the implications of these specifications is derive the conditional marginal effects and assess these for plausibility.
Group Marginal Effects
We see here the different levels expected for different regional responses for changes in each of these input variables.
g = plot_predictions(grp_sum_fit, condition = c("sup_parents_p3", "region"), type = "response") + ggtitle("Counterfactual Shift of Outcome: sup_parents_p3", "Holding all else Fixed")
g1 = plot_predictions(grp_sum_fit, condition = c("sup_friends_p1", "region"), type = "response") + ggtitle("Counterfactual Shift of Outcome: sup_friends_p1", "Holding all else Fixed")
g2 = plot_predictions(grp_sum_fit, condition = c("se_acad_p1", "region"),
type = "response") + ggtitle("Counterfactual Shift of Outcome: se_acad_p1", "Holding all else Fixed")
plot <- ggarrange(g,g1,g2, ncol=1, nrow=3);Model Design and Conditional Exchangeability
Modelling is nearly always about contrasts and questions of meaningful differences. What we seek to do when we build a model is to find a structure that enables “fair” or “justified” inferences about these contrasts. This is maybe most palpably brought home when we consider cases of causal inference. Here we want to define a meaningful causal estimand - some contrast between treatment group and control group where we can be confident that the two groups are suitably “representative” so that the observed differences represents the effect of the treatment. We are saying that conditional on our model design we consider the potential outcomes to be exchangable. Or another way of putting it is that we have controlled for all aspects of systematic differences between the control and treatment group and this warrants the causal interpretation of the contrast between treatment and control.
But as we’ve seen above the cleanest way to interpret almost any regression model is to understand the model design in terms of the marginal effects on the outcome scale. These are just contrasts. All statistical models are, in some sense, focused on finding a structure that licenses an inference about some contrast of interest between the levels of some observed variable and the implied outcomes. In this way we want to include as much structure in our model that would induce the status of conditional exchangeability between the units of study across such group contrasts. This notion of conditional exchangeability is inherently an epistemic notion. We believe that our model is apt to induce the status of conditional exchangeability such that the people in our survey have no systematic difference which biases the measured differences. The implied differences in some marginal effect (while holding all else fixed) is a “fair” representations the same counterfactual adjustment in the population or sub-population defined by the covariate profile. The model implied difference is a proxy for the result of possible future interventions and as such merits our attention in policy design.
In particular the view here is that we ought to be deliberate in how we structure our models to license the plausibility of the exchangeability claim. By De Finetti’s theorem a distribution of exchangeable sequence of variables be expressed as mixture of conditional independent variables. So if we specify the conditional distribution correctly, we recover the conditions that warrant inference with a well designed model.The mixture distribution is just the vector of parameters \(\boldsymbol{\beta}\) upon which we condition our model. Mislevy and Levy highlight this important observation has implications for model development by quoting Judea Pearl:
[C]onditional independence is not a grace of nature for which we must wait passively, but rather a psychological necessity which we satisfy by organising our knowledge in a specific way. An important tool in such an organisation is the identification of intermediate variables that induce conditional independence among observables; if such variables are not in our vocabulary, we create them. In medical diagnosis, for instance, when some symptoms directly influence one another, the medical profession invents a name for that interaction (e.g. “syndrome”, “complication”, “pathological state”) and treats it as a new auxilary variable that induces conditional independence.” - Pearl quoted in Bayesian Psychometric Modeling p61
The organisation of our data into meaningful categories or clusters is common in machine learning, but similarly in psychometrics we see the prevalence of factor models which are more explicitly causal models where we posit some unifying theme to the cluster of measures. In each tradition we are seeking conditional independence by allowing the model to account for these dependencies in how we structure the model. These choices represent a fruitful aspect of walking down through the garden of forking paths. There is then a tension between the exploratory seeking for some pre-determined story and an experimental seeking of clarity. With this in mind we now turn to types of modeling architecture that open up new expressive possibilities and offer a way to think about the relationships between observed data and latent factors that drive the observed outcomes. This greater expressive power offers different routes to inducing patterns of conditional exchangeability and different risks too.
Confirmatory Factor Analysis
We will illustrate the details of confirmatory factor modelling using the lavaan framework. We’ll focus mostly on the mechanics of how these models are estimated using maximum likelihood techniques before illustrating the salient differences of Bayesian estimation.
First recall that the idea of confirmatory factor analysis is that there are some latent constructs which determine our data generating process. In our survey we’ve already clustered our questions by themes so it makes sense to extend this idea to posit latent constructs mapped to each of these themes. In the jargon of structural equation models this is called the measurement model.
model_measurement <- "
# Measurement model
SUP_Parents =~ sup_parents_p1 + sup_parents_p2 + sup_parents_p3
SUP_Friends =~ sup_friends_p1 + sup_friends_p2 + sup_friends_p3
SE_Academic =~ se_acad_p1 + se_acad_p2 + se_acad_p3
SE_Social =~ se_social_p1 + se_social_p2 + se_social_p3
LS =~ ls_p1 + ls_p2 + ls_p3
"
model_measurement1 <- "
# Measurement model
SUP_Parents =~ b1*sup_parents_p1 + b2*sup_parents_p2 + b3*sup_parents_p3
SUP_Friends =~ a1*sup_friends_p1 + a2*sup_friends_p2 + a3*sup_friends_p3
SE_Academic =~ c1*se_acad_p1 + c2*se_acad_p2 + c3*se_acad_p3
SE_Social =~ se_social_p1 + se_social_p2 + se_social_p3
LS =~ ls_p1 + ls_p2 + ls_p3
a1 == a2
a1 == a3
b1 == b2
b1 == b3
c1 == c2
c1 == c3
"
fit_mod <- cfa(model_measurement, data = df)
fit_mod_1<- cfa(model_measurement1, data = df)In the above syntax we have specified two slightly different measurement models. In each case we allow that the questions of our survey are mapped to an appropriate latent factor e.g LS =~ ls_p1 + ls_p2 + ls_p3. The “=~” syntax denotes a “measured by” relationship in which the goal is to estimate how each of observed measurements load on the latent factor. In the first model we have allowed each of the factor loadings to be estimated freely, but in the second we have forced equal weights on the SUP_Friends construct. A benefit of this framework is that we do not have to resort to crude aggregations like sum-scores or mean-scores over the outcome variables we can allow that they vary freely and let the model estimate the multivariate relationships between the observed variables and these latent constructs.
If we plot the estimated parameters as before we’ll see some additional parameters reported.
cfa_models = list("full_measurement_model" = fit_mod,
"measurement_model_reduced" = fit_mod_1)
modelplot(cfa_models)Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).
Here there are two distinct types of parameters: (i) the factor loadings accorded (=~) to the individual observed metrics and (ii) the covariances (~~) between the latent constructs. We can further report the extent of the model fit summaries.
summary(fit_mod, fit.measures = TRUE, standardized = TRUE) lavaan 0.6-18 ended normally after 49 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 40
Number of observations 283
Model Test User Model:
Test statistic 223.992
Degrees of freedom 80
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 2696.489
Degrees of freedom 105
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.944
Tucker-Lewis Index (TLI) 0.927
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -4285.972
Loglikelihood unrestricted model (H1) -4173.976
Akaike (AIC) 8651.944
Bayesian (BIC) 8797.761
Sample-size adjusted Bayesian (SABIC) 8670.921
Root Mean Square Error of Approximation:
RMSEA 0.080
90 Percent confidence interval - lower 0.067
90 Percent confidence interval - upper 0.092
P-value H_0: RMSEA <= 0.050 0.000
P-value H_0: RMSEA >= 0.080 0.500
Standardized Root Mean Square Residual:
SRMR 0.056
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
SUP_Parents =~
sup_parents_p1 1.000 0.935 0.873
sup_parents_p2 1.036 0.056 18.613 0.000 0.969 0.887
sup_parents_p3 0.996 0.059 16.754 0.000 0.932 0.816
SUP_Friends =~
sup_friends_p1 1.000 1.021 0.906
sup_friends_p2 0.792 0.043 18.463 0.000 0.809 0.857
sup_friends_p3 0.891 0.050 17.741 0.000 0.910 0.831
SE_Academic =~
se_acad_p1 1.000 0.695 0.878
se_acad_p2 0.809 0.050 16.290 0.000 0.562 0.820
se_acad_p3 0.955 0.058 16.500 0.000 0.664 0.829
SE_Social =~
se_social_p1 1.000 0.638 0.843
se_social_p2 0.967 0.056 17.248 0.000 0.617 0.885
se_social_p3 0.928 0.067 13.880 0.000 0.592 0.741
LS =~
ls_p1 1.000 0.667 0.718
ls_p2 0.778 0.074 10.498 0.000 0.519 0.712
ls_p3 0.968 0.090 10.730 0.000 0.645 0.732
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
SUP_Parents ~~
SUP_Friends 0.132 0.064 2.073 0.038 0.138 0.138
SE_Academic 0.218 0.046 4.727 0.000 0.336 0.336
SE_Social 0.282 0.045 6.224 0.000 0.472 0.472
LS 0.405 0.057 7.132 0.000 0.650 0.650
SUP_Friends ~~
SE_Academic 0.071 0.047 1.493 0.136 0.100 0.100
SE_Social 0.196 0.046 4.281 0.000 0.301 0.301
LS 0.175 0.051 3.445 0.001 0.257 0.257
SE_Academic ~~
SE_Social 0.271 0.036 7.493 0.000 0.611 0.611
LS 0.238 0.039 6.065 0.000 0.514 0.514
SE_Social ~~
LS 0.321 0.042 7.659 0.000 0.755 0.755
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sup_parents_p1 0.273 0.037 7.358 0.000 0.273 0.238
.sup_parents_p2 0.255 0.038 6.738 0.000 0.255 0.213
.sup_parents_p3 0.437 0.048 9.201 0.000 0.437 0.335
.sup_friends_p1 0.227 0.040 5.656 0.000 0.227 0.179
.sup_friends_p2 0.238 0.030 7.936 0.000 0.238 0.266
.sup_friends_p3 0.371 0.042 8.809 0.000 0.371 0.310
.se_acad_p1 0.144 0.022 6.593 0.000 0.144 0.229
.se_acad_p2 0.153 0.018 8.621 0.000 0.153 0.327
.se_acad_p3 0.200 0.024 8.372 0.000 0.200 0.313
.se_social_p1 0.166 0.020 8.134 0.000 0.166 0.290
.se_social_p2 0.106 0.016 6.542 0.000 0.106 0.217
.se_social_p3 0.288 0.028 10.132 0.000 0.288 0.451
.ls_p1 0.417 0.045 9.233 0.000 0.417 0.484
.ls_p2 0.261 0.028 9.321 0.000 0.261 0.492
.ls_p3 0.362 0.040 9.005 0.000 0.362 0.465
SUP_Parents 0.875 0.098 8.910 0.000 1.000 1.000
SUP_Friends 1.042 0.111 9.407 0.000 1.000 1.000
SE_Academic 0.483 0.054 8.880 0.000 1.000 1.000
SE_Social 0.407 0.048 8.403 0.000 1.000 1.000
LS 0.444 0.069 6.394 0.000 1.000 1.000
Note how in addition to the individual parameter estimates the summaries highlight various measures of global model fit. These model fit statistics are important for evaluating alternative ways of parameterising our models. The number of parameters is a real concern in the maximum likelihood approaches to estimating these models. Too many parameters and we can easily over fit to the particular sample data. This stems in part from the limitations of the optimisation goal in the traditional CFA framework - we are intent to optimise model parameters to recover a compelling estimate based on the observed covariance matrix. Once we have more parameters than there are points in the covariance matrix the model is free to overfit considerably. This can then be checked as measure of local model fit and may highlight infelicities or suspicious convergence between the true data generating process and the learned model.
g1 = plot_heatmap(cov(df[, drivers]))
g2 = plot_heatmap(data.frame(fitted(fit_mod)$cov)[drivers, drivers], title="Model Implied Covariances", "Fitted Values")
resids = cov(df[, drivers]) - data.frame(fitted(fit_mod)$cov)[drivers, drivers]
g3 = plot_heatmap(resids, title="Residuals", "Fitted Values versus Observe Sample Covariance")
plot <- ggarrange(g1,g2,g3, ncol=1, nrow=3);Summary Global Fit Measures
We can also compare models based on their global measures of model fit giving some indication of whether parameter specifications improve or reduce fidelity with the true data generating process.
summary_df = cbind(fitMeasures(fit_mod, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")),
fitMeasures(fit_mod_1, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")))
colnames(summary_df) = c('Full Model', 'Reduced Model')
summary_df |> kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")| Full Model | Reduced Model | |
|---|---|---|
| chisq | 223.9922306 | 256.0287010 |
| baseline.chisq | 2696.4887420 | 2696.4887420 |
| cfi | 0.9444365 | 0.9343896 |
| aic | 8651.9435210 | 8671.9799914 |
| bic | 8797.7613969 | 8795.9251859 |
| rmsea | 0.0797501 | 0.0835831 |
| srmr | 0.0558656 | 0.0625710 |
There a wealth of metrics associated with CFA model fit and it can be hard to see the forest for the trees.
Visualising the Relationships
One of the better ways to visualise these models is to use the semPlot package. Here we can plot all the parameter estimates in one graph. Following convention the rectangular boxes represent observed measures. Oval or circular objects represent the latent constructs. The self-directed arrows on each node is the variance of that measure. The two-way arrows between nodes represents the covariance between those two nodes. The single headed arrows from the latent construct to the indicator variables denotes the factor loading of the variable on the construct.
semPlot::semPaths(fit_mod, whatLabels = 'est', intercepts = FALSE, layout = "spring",)For instance, in this plot you can see that for each latent construct one of the indicator variables has their factor loading set to 1. This is a mathematical requirement we’ll see below that is used to ensure identifiability of the parameters akin to setting a reference category in categorical regression. Additionally you can “read-off” the covariances between our constructs e.g. the covariance between LS and SUP_P is 0.36 the largest value amongst the set of covariances.
Comparing Models
We can use a variety of chi-squared tests to evaluate the goodness of fit for our models. If we pass in each model individually we perform a test comparing our model to the saturated model. The Chi-Squared test compares the model-implied variance-covariance matrix (expected) to the variance-covariance matrix computed from the actual data (observed). The null hypothesis for the Chi-Squared Goodness-of-Fit test is that the model fits the data perfectly, meaning that there is no significant difference between the observed and model-implied variance-covariance matrices. The goal is to see if the differences between these matrices are large enough that we can reject the null.
lavTestLRT(fit_mod)Chi-Squared Test Statistic (unscaled)
Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
Saturated 0 0.00
Model 80 8651.9 8797.8 223.99 223.99 80 0.000000000000001443 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Passing in the one model we can reject the null hypothesis that the saturated model’s (perfect fit) and the candidate variance-covariance matrix are drawn from the same distribution. Comparing between our two model fits we also reject that these two models are drawn from the same distribution.
lavTestLRT(fit_mod, fit_mod_1)
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit_mod 80 8651.9 8797.8 223.99
fit_mod_1 86 8672.0 8795.9 256.03 32.036 0.12383 6 0.00001606 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test also reports a number of other fit indices and the degrees of freedom available to each model. These are important because the Chi-Squared test is overly sensetive in large-sample data and the model adequacy is a multi-dimensional question.
Structural Equation Models
So far so good. We have a confirmatory factor measurement model. We’ve structured it so that we can make inferences about the correlations and covariances between 5 latent constructs of independent interest. We’ve calibrated the model fit statistics by ensuring the model can reasonably recover the observed variance-covariance structure. But what about our dependency relations between constructs? We can evaluate these too! Adding regressions to our model allows to express these relationships and then recover summary statistics of the same.
model <- "
# Measurement model
SUP_Parents =~ sup_parents_p1 + sup_parents_p2 + sup_parents_p3
SUP_Friends =~ sup_friends_p1 + sup_friends_p2 + sup_friends_p3
SE_Academic =~ se_acad_p1 + se_acad_p2 + se_acad_p3
SE_Social =~ se_social_p1 + se_social_p2 + se_social_p3
LS =~ ls_p1 + ls_p2 + ls_p3
# Structural model
# Regressions
SE_Academic ~ SUP_Parents + SUP_Friends
SE_Social ~ SUP_Parents + SUP_Friends
LS ~ SE_Academic + SE_Social + SUP_Parents + SUP_Friends
# Residual covariances
SE_Academic ~~ SE_Social
"
fit_mod_sem <- sem(model, data = df)modelplot(fit_mod_sem)semPlot::semPaths(fit_mod_sem, whatLabels = 'est', intercepts = FALSE)Compare this structure against the previously simpler measurement model and we observe a puzzling phenomena. The models report identical measures of fit.
summary_df = cbind(fitMeasures(fit_mod, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")),
fitMeasures(fit_mod_sem, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")),
fitMeasures(fit_mod_1, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")))
colnames(summary_df) = c('Full Model', 'SEM Model', 'Reduced Model')
summary_df |> kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")| Full Model | SEM Model | Reduced Model | |
|---|---|---|---|
| chisq | 223.9922306 | 223.9922306 | 256.0287010 |
| baseline.chisq | 2696.4887420 | 2696.4887420 | 2696.4887420 |
| cfi | 0.9444365 | 0.9444365 | 0.9343896 |
| aic | 8651.9435210 | 8651.9435210 | 8671.9799914 |
| bic | 8797.7613969 | 8797.7613969 | 8795.9251859 |
| rmsea | 0.0797501 | 0.0797501 | 0.0835831 |
| srmr | 0.0558656 | 0.0558655 | 0.0625710 |
The models have the same degrees of freedom which suggests in some sense we have already saturated our model fit and are unable to evaluate further parameter estimates.
lavTestLRT(fit_mod_sem, fit_mod)Warning: lavaan->lavTestLRT():
some models have the same degrees of freedom
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit_mod_sem 80 8651.9 8797.8 223.99
fit_mod 80 8651.9 8797.8 223.99 0.0000000043218 0 0
lavTestLRT(fit_mod_sem, fit_mod_1)
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit_mod_sem 80 8651.9 8797.8 223.99
fit_mod_1 86 8672.0 8795.9 256.03 32.036 0.12383 6 0.00001606 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see this similarly in the plotted residuals which are identical across the models despite meaningful structural differences.
heat_df = data.frame(resid(fit_mod, type = "standardized")$cov)
heat_df = heat_df |> as.matrix() |> melt()
colnames(heat_df) <- c("x", "y", "value")
g1 = heat_df |> mutate(value = round(value, 2)) |> ggplot(aes(x = x, y = y, fill = value)) +
geom_tile() + geom_text(aes(label = value), color = "black", size = 4) +
scale_fill_gradient2(
high = 'dodgerblue4',
mid = 'white',
low = 'firebrick2'
) + theme(axis.text.x = element_text(angle=45)) + ggtitle("Standardised Residuals of the Sample Covariances and Model Implied Covariances", "A Visual Check of SEM Model fit")
heat_df = data.frame(resid(fit_mod_sem, type = "standardized")$cov)
heat_df = heat_df |> as.matrix() |> melt()
colnames(heat_df) <- c("x", "y", "value")
g2 = heat_df |> mutate(value = round(value, 2)) |> ggplot(aes(x = x, y = y, fill = value)) +
geom_tile() + geom_text(aes(label = value), color = "black", size = 4) +
scale_fill_gradient2(
high = 'dodgerblue4',
mid = 'white',
low = 'firebrick2'
) + theme(axis.text.x = element_text(angle=45)) + ggtitle("Standardised Residuals of the Sample Covariances and Model Implied Covariances", "A Visual Check of Measurement Model fit")
plot <- ggarrange(g1,g2, ncol=1, nrow=2);This is a genuine limitation in the expressive power of SEM models when they are fit using maximum likelihood with finite degrees of freedom optimising for fidelity to the sample varaince-covariance matrix.
Confirmatory Factor Models with PyMC
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pytensor import tensor as pt
import arviz as az
import networkx as nx
np.random.seed(150)
df_p = pd.read_csv('IIS.dat', sep='\s+')
df_p.head() PI AD IGC FI FC
0 4.00 3.38 4.67 2.6 3.2
1 2.57 3.00 3.50 2.4 2.8
2 2.29 3.29 4.83 2.0 3.4
3 2.43 3.63 4.33 3.6 3.8
4 3.00 4.00 4.83 3.4 3.8
coords = {'obs': list(range(len(df_p))),
'indicators': ['PI', 'AD', 'IGC', 'FI', 'FC'],
'indicators_1': ['PI', 'AD', 'IGC'],
'indicators_2': ['FI', 'FC'],
'latent': ['Student', 'Faculty']
}
obs_idx = list(range(len(df_p)))
with pm.Model(coords=coords) as model:
Psi = pm.InverseGamma('Psi', 5, 10, dims='indicators')
lambdas_ = pm.Normal('lambdas_1', 1, 10, dims=('indicators_1'))
lambdas_1 = pm.Deterministic('lambdas1', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_1'))
lambdas_ = pm.Normal('lambdas_2', 1, 10, dims=('indicators_2'))
lambdas_2 = pm.Deterministic('lambdas2', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_2'))
tau = pm.Normal('tau', 3, 10, dims='indicators')
kappa = 0
sd_dist = pm.Exponential.dist(1.0, shape=2)
chol, _, _ = pm.LKJCholeskyCov('chol_cov', n=2, eta=2,
sd_dist=sd_dist, compute_corr=True)
ksi = pm.MvNormal('ksi', kappa, chol=chol, dims=('obs', 'latent'))
m1 = tau[0] + ksi[obs_idx, 0]*lambdas_1[0]
m2 = tau[1] + ksi[obs_idx, 0]*lambdas_1[1]
m3 = tau[2] + ksi[obs_idx, 0]*lambdas_1[2]
m4 = tau[3] + ksi[obs_idx, 1]*lambdas_2[0]
m5 = tau[4] + ksi[obs_idx, 1]*lambdas_2[1]
mu = pm.Deterministic('mu', pm.math.stack([m1, m2, m3, m4, m5]).T)
_ = pm.Normal('likelihood', mu, Psi, observed=df_p.values)
idata = pm.sample(nuts_sampler='numpyro', target_accept=.95,
idata_kwargs={"log_likelihood": True})
idata.extend(pm.sample_posterior_predictive(idata))
summary_df = az.summary(idata, var_names=['lambdas1', 'lambdas2', 'tau', 'Psi', 'ksi', 'chol_cov_corr'], coords= {'obs': [0, 7]})py$summary_df |> kable() |> kable_classic(full_width = F, html_font = "Cambria")| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lambdas1[PI] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
| lambdas1[AD] | 0.898 | 0.061 | 0.788 | 1.017 | 0.003 | 0.002 | 340 | 594 | 1.01 |
| lambdas1[IGC] | 0.535 | 0.044 | 0.452 | 0.616 | 0.002 | 0.001 | 473 | 859 | 1.01 |
| lambdas2[FI] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
| lambdas2[FC] | 0.978 | 0.058 | 0.871 | 1.084 | 0.003 | 0.002 | 333 | 828 | 1.00 |
| tau[PI] | 3.333 | 0.037 | 3.265 | 3.403 | 0.001 | 0.001 | 781 | 1727 | 1.01 |
| tau[AD] | 3.898 | 0.027 | 3.849 | 3.949 | 0.001 | 0.001 | 557 | 1012 | 1.01 |
| tau[IGC] | 4.596 | 0.021 | 4.555 | 4.634 | 0.001 | 0.000 | 919 | 1669 | 1.01 |
| tau[FI] | 3.034 | 0.039 | 2.961 | 3.110 | 0.002 | 0.001 | 667 | 1200 | 1.01 |
| tau[FC] | 3.713 | 0.034 | 3.647 | 3.774 | 0.001 | 0.001 | 552 | 1137 | 1.01 |
| Psi[PI] | 0.609 | 0.024 | 0.563 | 0.653 | 0.001 | 0.000 | 1309 | 2315 | 1.00 |
| Psi[AD] | 0.317 | 0.020 | 0.280 | 0.356 | 0.001 | 0.001 | 630 | 1168 | 1.00 |
| Psi[IGC] | 0.355 | 0.014 | 0.329 | 0.380 | 0.000 | 0.000 | 2422 | 2635 | 1.00 |
| Psi[FI] | 0.569 | 0.026 | 0.519 | 0.615 | 0.001 | 0.001 | 933 | 1900 | 1.00 |
| Psi[FC] | 0.421 | 0.026 | 0.371 | 0.469 | 0.001 | 0.001 | 577 | 1350 | 1.00 |
| ksi[0, Student] | -0.226 | 0.222 | -0.618 | 0.195 | 0.003 | 0.003 | 4292 | 3262 | 1.00 |
| ksi[0, Faculty] | -0.369 | 0.268 | -0.878 | 0.120 | 0.004 | 0.003 | 4332 | 2753 | 1.00 |
| ksi[7, Student] | 0.889 | 0.229 | 0.471 | 1.319 | 0.004 | 0.003 | 3487 | 2770 | 1.00 |
| ksi[7, Faculty] | 0.871 | 0.278 | 0.336 | 1.376 | 0.004 | 0.003 | 4253 | 2880 | 1.00 |
| chol_cov_corr[0, 0] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
| chol_cov_corr[0, 1] | 0.850 | 0.028 | 0.796 | 0.899 | 0.001 | 0.001 | 479 | 795 | 1.00 |
| chol_cov_corr[1, 0] | 0.850 | 0.028 | 0.796 | 0.899 | 0.001 | 0.001 | 479 | 795 | 1.00 |
| chol_cov_corr[1, 1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 3772 | 3881 | 1.00 |
az.plot_trace(idata, var_names=['lambdas1', 'lambdas2', 'tau', 'Psi', 'ksi']);/Users/nathanielforde/mambaforge/envs/pymc_causal/lib/python3.11/site-packages/arviz/stats/density_utils.py:487: UserWarning: Your data appears to have a single value or no finite values
warnings.warn("Your data appears to have a single value or no finite values")
/Users/nathanielforde/mambaforge/envs/pymc_causal/lib/python3.11/site-packages/arviz/stats/density_utils.py:487: UserWarning: Your data appears to have a single value or no finite values
warnings.warn("Your data appears to have a single value or no finite values")
df = pd.read_csv('sem_data.csv')
drivers = ['se_acad_p1', 'se_acad_p2',
'se_acad_p3', 'se_social_p1', 'se_social_p2', 'se_social_p3',
'sup_friends_p1', 'sup_friends_p2', 'sup_friends_p3', 'sup_parents_p1',
'sup_parents_p2', 'sup_parents_p3', 'ls_p1', 'ls_p2', 'ls_p3']
coords = {'obs': list(range(len(df))),
'indicators': drivers,
'indicators_1': ['se_acad_p1','se_acad_p2','se_acad_p3'],
'indicators_2': ['se_social_p1','se_social_p2','se_social_p3'],
'indicators_3': ['sup_friends_p1','sup_friends_p2','sup_friends_p3'],
'indicators_4': [ 'sup_parents_p1','sup_parents_p2','sup_parents_p3'],
'indicators_5': ['ls_p1','ls_p2', 'ls_p3'],
'latent': ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS'],
'latent1': ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
}
obs_idx = list(range(len(df)))
with pm.Model(coords=coords) as model:
Psi = pm.InverseGamma('Psi', 5, 10, dims='indicators')
lambdas_ = pm.Normal('lambdas_1', 1, 10, dims=('indicators_1'))
lambdas_1 = pm.Deterministic('lambdas1', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_1'))
lambdas_ = pm.Normal('lambdas_2', 1, 10, dims=('indicators_2'))
lambdas_2 = pm.Deterministic('lambdas2', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_2'))
lambdas_ = pm.Normal('lambdas_3', 1, 10, dims=('indicators_3'))
lambdas_3 = pm.Deterministic('lambdas3', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_3'))
lambdas_ = pm.Normal('lambdas_4', 1, 10, dims=('indicators_4'))
lambdas_4 = pm.Deterministic('lambdas4', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_4'))
lambdas_ = pm.Normal('lambdas_5', 1, 10, dims=('indicators_5'))
lambdas_5 = pm.Deterministic('lambdas5', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_5'))
tau = pm.Normal('tau', 3, 10, dims='indicators')
kappa = 0
sd_dist = pm.Exponential.dist(1.0, shape=5)
chol, _, _ = pm.LKJCholeskyCov('chol_cov', n=5, eta=2,
sd_dist=sd_dist, compute_corr=True)
cov = pm.Deterministic("cov", chol.dot(chol.T), dims=('latent', 'latent1'))
ksi = pm.MvNormal('ksi', kappa, chol=chol, dims=('obs', 'latent'))
m0 = tau[0] + ksi[obs_idx, 0]*lambdas_1[0]
m1 = tau[1] + ksi[obs_idx, 0]*lambdas_1[1]
m2 = tau[2] + ksi[obs_idx, 0]*lambdas_1[2]
m3 = tau[3] + ksi[obs_idx, 1]*lambdas_2[0]
m4 = tau[4] + ksi[obs_idx, 1]*lambdas_2[1]
m5 = tau[5] + ksi[obs_idx, 1]*lambdas_2[2]
m6 = tau[6] + ksi[obs_idx, 2]*lambdas_3[0]
m7 = tau[7] + ksi[obs_idx, 2]*lambdas_3[1]
m8 = tau[8] + ksi[obs_idx, 2]*lambdas_3[2]
m9 = tau[9] + ksi[obs_idx, 3]*lambdas_4[0]
m10 = tau[10] + ksi[obs_idx, 3]*lambdas_4[1]
m11 = tau[11] + ksi[obs_idx, 3]*lambdas_4[2]
m12 = tau[12] + ksi[obs_idx, 4]*lambdas_5[0]
m13 = tau[13] + ksi[obs_idx, 4]*lambdas_5[1]
m14 = tau[14] + ksi[obs_idx, 4]*lambdas_5[2]
mu = pm.Deterministic('mu', pm.math.stack([m0, m1, m2, m3, m4, m5, m6, m7,
m8, m9, m10, m11, m12, m13, m14]).T)
_ = pm.Normal('likelihood', mu, Psi, observed=df[drivers].values)
idata = pm.sample(nuts_sampler='numpyro', target_accept=.95, tune=1000,
idata_kwargs={"log_likelihood": True}, random_seed=100)
idata.extend(pm.sample_posterior_predictive(idata))
summary_df1 = az.summary(idata, var_names=['lambdas1', 'lambdas2', 'lambdas3', 'lambdas4', 'lambdas5', 'tau', 'Psi'])
cov_df = pd.DataFrame(az.extract(idata['posterior'])['cov'].mean(axis=2))
cov_df.index = ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
cov_df.columns = ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
correlation_df = pd.DataFrame(az.extract(idata['posterior'])['chol_cov_corr'].mean(axis=2))
correlation_df.index = ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
correlation_df.columns = ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']py$summary_df1 |> kable() |> kable_classic(full_width = F, html_font = "Cambria")| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lambdas1[se_acad_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
| lambdas1[se_acad_p2] | 0.817 | 0.052 | 0.720 | 0.915 | 0.001 | 0.001 | 1193 | 2008 | 1.00 |
| lambdas1[se_acad_p3] | 0.967 | 0.060 | 0.854 | 1.076 | 0.002 | 0.001 | 1286 | 2037 | 1.00 |
| lambdas2[se_social_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
| lambdas2[se_social_p2] | 0.965 | 0.058 | 0.856 | 1.071 | 0.002 | 0.002 | 757 | 1634 | 1.00 |
| lambdas2[se_social_p3] | 0.941 | 0.072 | 0.805 | 1.076 | 0.002 | 0.002 | 878 | 1580 | 1.00 |
| lambdas3[sup_friends_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
| lambdas3[sup_friends_p2] | 0.802 | 0.044 | 0.720 | 0.887 | 0.001 | 0.001 | 1045 | 1701 | 1.00 |
| lambdas3[sup_friends_p3] | 0.905 | 0.053 | 0.805 | 1.006 | 0.002 | 0.001 | 1235 | 2150 | 1.00 |
| lambdas4[sup_parents_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
| lambdas4[sup_parents_p2] | 1.040 | 0.059 | 0.931 | 1.150 | 0.002 | 0.002 | 758 | 1383 | 1.00 |
| lambdas4[sup_parents_p3] | 1.010 | 0.064 | 0.898 | 1.137 | 0.002 | 0.001 | 1051 | 1840 | 1.00 |
| lambdas5[ls_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
| lambdas5[ls_p2] | 0.791 | 0.085 | 0.627 | 0.944 | 0.004 | 0.003 | 541 | 1074 | 1.00 |
| lambdas5[ls_p3] | 0.990 | 0.103 | 0.806 | 1.187 | 0.004 | 0.003 | 543 | 878 | 1.00 |
| tau[se_acad_p1] | 5.153 | 0.044 | 5.069 | 5.234 | 0.002 | 0.001 | 488 | 1151 | 1.01 |
| tau[se_acad_p2] | 5.345 | 0.039 | 5.271 | 5.414 | 0.002 | 0.001 | 528 | 1033 | 1.01 |
| tau[se_acad_p3] | 5.209 | 0.045 | 5.127 | 5.297 | 0.002 | 0.001 | 526 | 1290 | 1.01 |
| tau[se_social_p1] | 5.286 | 0.042 | 5.208 | 5.366 | 0.002 | 0.002 | 380 | 743 | 1.01 |
| tau[se_social_p2] | 5.473 | 0.039 | 5.397 | 5.544 | 0.002 | 0.001 | 363 | 742 | 1.01 |
| tau[se_social_p3] | 5.437 | 0.045 | 5.351 | 5.522 | 0.002 | 0.001 | 492 | 982 | 1.00 |
| tau[sup_friends_p1] | 5.782 | 0.068 | 5.651 | 5.904 | 0.004 | 0.003 | 333 | 763 | 1.01 |
| tau[sup_friends_p2] | 6.007 | 0.057 | 5.909 | 6.125 | 0.003 | 0.002 | 397 | 872 | 1.00 |
| tau[sup_friends_p3] | 5.987 | 0.066 | 5.864 | 6.115 | 0.003 | 0.002 | 385 | 890 | 1.01 |
| tau[sup_parents_p1] | 5.973 | 0.061 | 5.858 | 6.085 | 0.003 | 0.002 | 427 | 1059 | 1.00 |
| tau[sup_parents_p2] | 5.925 | 0.062 | 5.807 | 6.040 | 0.003 | 0.002 | 394 | 924 | 1.01 |
| tau[sup_parents_p3] | 5.716 | 0.066 | 5.596 | 5.840 | 0.003 | 0.002 | 470 | 1294 | 1.00 |
| tau[ls_p1] | 5.188 | 0.053 | 5.092 | 5.289 | 0.002 | 0.001 | 654 | 1378 | 1.00 |
| tau[ls_p2] | 5.775 | 0.041 | 5.693 | 5.849 | 0.002 | 0.001 | 716 | 1596 | 1.00 |
| tau[ls_p3] | 5.219 | 0.051 | 5.121 | 5.314 | 0.002 | 0.001 | 666 | 1609 | 1.00 |
| Psi[se_acad_p1] | 0.412 | 0.028 | 0.359 | 0.465 | 0.001 | 0.001 | 1278 | 1740 | 1.00 |
| Psi[se_acad_p2] | 0.413 | 0.024 | 0.367 | 0.456 | 0.001 | 0.000 | 2170 | 2268 | 1.00 |
| Psi[se_acad_p3] | 0.468 | 0.027 | 0.418 | 0.519 | 0.001 | 0.000 | 1844 | 2408 | 1.00 |
| Psi[se_social_p1] | 0.431 | 0.026 | 0.381 | 0.477 | 0.001 | 0.000 | 1382 | 2219 | 1.00 |
| Psi[se_social_p2] | 0.361 | 0.025 | 0.314 | 0.405 | 0.001 | 0.000 | 1486 | 2135 | 1.00 |
| Psi[se_social_p3] | 0.553 | 0.029 | 0.500 | 0.606 | 0.001 | 0.000 | 2594 | 2803 | 1.00 |
| Psi[sup_friends_p1] | 0.517 | 0.040 | 0.439 | 0.587 | 0.001 | 0.001 | 866 | 1739 | 1.00 |
| Psi[sup_friends_p2] | 0.508 | 0.031 | 0.454 | 0.568 | 0.001 | 0.001 | 1420 | 1985 | 1.00 |
| Psi[sup_friends_p3] | 0.625 | 0.036 | 0.562 | 0.694 | 0.001 | 0.001 | 2090 | 2329 | 1.00 |
| Psi[sup_parents_p1] | 0.550 | 0.035 | 0.485 | 0.615 | 0.001 | 0.001 | 1530 | 2075 | 1.00 |
| Psi[sup_parents_p2] | 0.536 | 0.038 | 0.465 | 0.605 | 0.001 | 0.001 | 1192 | 2078 | 1.01 |
| Psi[sup_parents_p3] | 0.675 | 0.038 | 0.602 | 0.745 | 0.001 | 0.001 | 2089 | 2371 | 1.00 |
| Psi[ls_p1] | 0.671 | 0.038 | 0.603 | 0.744 | 0.001 | 0.001 | 1045 | 2387 | 1.00 |
| Psi[ls_p2] | 0.534 | 0.030 | 0.477 | 0.591 | 0.001 | 0.001 | 1409 | 2472 | 1.00 |
| Psi[ls_p3] | 0.622 | 0.035 | 0.554 | 0.688 | 0.001 | 0.001 | 1597 | 2393 | 1.00 |
fig, ax = plt.subplots(figsize=(15, 8))
az.plot_forest(idata, var_names=['lambdas1', 'lambdas2', 'lambdas3', 'lambdas4', 'lambdas5'], combined=True, ax=ax);
ax.set_title("Factor Loadings for each of the Five Factors");py$cov_df |> kable(caption= "Covariances Amongst Latent Factors",digits=2) |> kable_styling() %>% kable_classic(full_width = F, html_font = "Cambria") |> row_spec(5, color = "red")| SE_ACAD | SE_SOCIAL | SUP_F | SUP_P | LS | |
|---|---|---|---|---|---|
| SE_ACAD | 0.47 | 0.26 | 0.06 | 0.20 | 0.22 |
| SE_SOCIAL | 0.26 | 0.39 | 0.19 | 0.26 | 0.30 |
| SUP_F | 0.06 | 0.19 | 1.03 | 0.12 | 0.16 |
| SUP_P | 0.20 | 0.26 | 0.12 | 0.86 | 0.38 |
| LS | 0.22 | 0.30 | 0.16 | 0.38 | 0.42 |
py$correlation_df |> kable( caption= "Correlations Amongst Latent Factors", digits=2) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") |> row_spec(5, color = "red")| SE_ACAD | SE_SOCIAL | SUP_F | SUP_P | LS | |
|---|---|---|---|---|---|
| SE_ACAD | 1.00 | 0.60 | 0.09 | 0.32 | 0.50 |
| SE_SOCIAL | 0.60 | 1.00 | 0.29 | 0.45 | 0.75 |
| SUP_F | 0.09 | 0.29 | 1.00 | 0.12 | 0.25 |
| SUP_P | 0.32 | 0.45 | 0.12 | 1.00 | 0.64 |
| LS | 0.50 | 0.75 | 0.25 | 0.64 | 1.00 |
def make_ppc(idata):
fig, axs = plt.subplots(5, 3, figsize=(20, 20))
axs = axs.flatten()
for i in range(15):
for j in range(100):
temp = az.extract(idata['posterior_predictive'].sel({'likelihood_dim_3': i}))['likelihood'].values[:, j]
temp = pd.DataFrame(temp, columns=['likelihood'])
if j == 0:
axs[i].hist(df[drivers[i]], alpha=0.3, ec='black', bins=20, label='Observed Scores')
axs[i].hist(temp['likelihood'], color='purple', alpha=0.1, bins=20, label='Predicted Scores')
else:
axs[i].hist(df[drivers[i]], alpha=0.3, ec='black', bins=20)
axs[i].hist(temp['likelihood'], color='purple', alpha=0.1, bins=20)
axs[i].set_title(f"Posterior Predictive Checks {drivers[i]}")
axs[i].legend();
plt.show()
make_ppc(idata)
def get_posterior_resids(idata, samples=100, metric='cov'):
resids = []
for i in range(100):
if metric == 'cov':
model_cov = pd.DataFrame(az.extract(idata['posterior_predictive'])['likelihood'][:, :, i]).cov()
obs_cov = df[drivers].cov()
else:
model_cov = pd.DataFrame(az.extract(idata['posterior_predictive'])['likelihood'][:, :, i]).corr()
obs_cov = df[drivers].corr()
model_cov.index = obs_cov.index
model_cov.columns = obs_cov.columns
residuals = model_cov - obs_cov
resids.append(residuals.values.flatten())
residuals_posterior = pd.DataFrame(pd.DataFrame(resids).mean().values.reshape(15, 15))
residuals_posterior.index = obs_cov.index
residuals_posterior.columns = obs_cov.index
return residuals_posterior
residuals_posterior_cov = get_posterior_resids(idata, 500)
residuals_posterior_corr = get_posterior_resids(idata, 500, metric='corr')
plot_heatmap(py$residuals_posterior_cov, "Residuals of the Sample Covariances and Model Implied Covariances", "A Visual Check of Bayesian Model fit")plot_heatmap(py$residuals_posterior_corr, "Residuals of the Sample Correlations and Model Implied Correlations", "A Visual Check of Bayesian Model fit")Structural Equation Modelling in PyMC
def make_sem(priors):
coords = {'obs': list(range(len(df))),
'indicators': drivers,
'indicators_1': ['se_acad_p1','se_acad_p2','se_acad_p3'],
'indicators_2': ['se_social_p1','se_social_p2','se_social_p3'],
'indicators_3': ['sup_friends_p1','sup_friends_p2','sup_friends_p3'],
'indicators_4': [ 'sup_parents_p1','sup_parents_p2','sup_parents_p3'],
'indicators_5': ['ls_p1','ls_p2', 'ls_p3'],
'latent': ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P'],
'latent1': ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P']
}
obs_idx = list(range(len(df)))
with pm.Model(coords=coords) as model:
Psi = pm.InverseGamma('Psi', 5, 10, dims='indicators')
lambdas_ = pm.Normal('lambdas_1', priors['lambda'][0], priors['lambda'][1], dims=('indicators_1'))
lambdas_1 = pm.Deterministic('lambdas1', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_1'))
lambdas_ = pm.Normal('lambdas_2', priors['lambda'][0], priors['lambda'][1], dims=('indicators_2'))
lambdas_2 = pm.Deterministic('lambdas2', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_2'))
lambdas_ = pm.Normal('lambdas_3', priors['lambda'][0], priors['lambda'][1], dims=('indicators_3'))
lambdas_3 = pm.Deterministic('lambdas3', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_3'))
lambdas_ = pm.Normal('lambdas_4', priors['lambda'][0], priors['lambda'][1], dims=('indicators_4'))
lambdas_4 = pm.Deterministic('lambdas4', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_4'))
lambdas_ = pm.Normal('lambdas_5', priors['lambda'][0], priors['lambda'][1], dims=('indicators_5'))
lambdas_5 = pm.Deterministic('lambdas5', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_5'))
tau = pm.Normal('tau', 3, 10, dims='indicators')
kappa = 0
sd_dist = pm.Exponential.dist(1.0, shape=4)
chol, _, _ = pm.LKJCholeskyCov('chol_cov', n=4, eta=priors['eta'],
sd_dist=sd_dist, compute_corr=True)
cov = pm.Deterministic("cov", chol.dot(chol.T), dims=('latent', 'latent1'))
ksi = pm.MvNormal('ksi', kappa, chol=chol, dims=('obs', 'latent'))
## Additional Regression Component
beta_r = pm.Normal('beta_r', 0, 1, dims='latent')
regression = pm.Deterministic('regr', beta_r[0]*ksi[obs_idx, 0] + beta_r[1]*ksi[obs_idx, 1] +
beta_r[2]*ksi[obs_idx, 2] + beta_r[3]*ksi[obs_idx, 3])
m0 = tau[0] + ksi[obs_idx, 0]*lambdas_1[0]
m1 = tau[1] + ksi[obs_idx, 0]*lambdas_1[1]
m2 = tau[2] + ksi[obs_idx, 0]*lambdas_1[2]
m3 = tau[3] + ksi[obs_idx, 1]*lambdas_2[0]
m4 = tau[4] + ksi[obs_idx, 1]*lambdas_2[1]
m5 = tau[5] + ksi[obs_idx, 1]*lambdas_2[2]
m6 = tau[6] + ksi[obs_idx, 2]*lambdas_3[0]
m7 = tau[7] + ksi[obs_idx, 2]*lambdas_3[1]
m8 = tau[8] + ksi[obs_idx, 2]*lambdas_3[2]
m9 = tau[9] + ksi[obs_idx, 3]*lambdas_4[0]
m10 = tau[10] + ksi[obs_idx, 3]*lambdas_4[1]
m11 = tau[11] + ksi[obs_idx, 3]*lambdas_4[2]
m12 = tau[12] + regression*lambdas_5[0]
m13 = tau[13] + regression*lambdas_5[1]
m14 = tau[14] + regression*lambdas_5[2]
mu = pm.Deterministic('mu', pm.math.stack([m0, m1, m2, m3, m4, m5, m6, m7,
m8, m9, m10, m11, m12, m13, m14]).T)
_ = pm.Normal('likelihood', mu, Psi, observed=df[drivers].values)
idata = pm.sample(chains=6, nuts_sampler='numpyro', target_accept=.95, tune=1000,
idata_kwargs={"log_likelihood": True}, random_seed=100)
idata.extend(pm.sample_posterior_predictive(idata))
return model, idata
model1, idata1 = make_sem(priors={'eta': 2, 'lambda': [1, 2]})az.plot_posterior(idata1, var_names=['beta_r']);make_ppc(idata1)SEM and Indirect Effects
def make_indirect_sem(priors):
coords = {'obs': list(range(len(df))),
'indicators': drivers,
'indicators_1': ['se_acad_p1','se_acad_p2','se_acad_p3'],
'indicators_2': ['se_social_p1','se_social_p2','se_social_p3'],
'indicators_3': ['sup_friends_p1','sup_friends_p2','sup_friends_p3'],
'indicators_4': [ 'sup_parents_p1','sup_parents_p2','sup_parents_p3'],
'indicators_5': ['ls_p1','ls_p2', 'ls_p3'],
'latent': ['SUP_F', 'SUP_P'],
'latent1': ['SUP_F', 'SUP_P'],
'latent_regression': ['SUP_F->SE_ACAD', 'SUP_P->SE_ACAD', 'SUP_F->SE_SOC', 'SUP_P->SE_SOC'],
'regression': ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P']
}
obs_idx = list(range(len(df)))
with pm.Model(coords=coords) as model:
Psi = pm.InverseGamma('Psi', 5, 10, dims='indicators')
lambdas_ = pm.Normal('lambdas_1', priors['lambda'][0], priors['lambda'][1], dims=('indicators_1'))
lambdas_1 = pm.Deterministic('lambdas1', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_1'))
lambdas_ = pm.Normal('lambdas_2', priors['lambda'][0], priors['lambda'][1], dims=('indicators_2'))
lambdas_2 = pm.Deterministic('lambdas2', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_2'))
lambdas_ = pm.Normal('lambdas_3', priors['lambda'][0], priors['lambda'][1], dims=('indicators_3'))
lambdas_3 = pm.Deterministic('lambdas3', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_3'))
lambdas_ = pm.Normal('lambdas_4', priors['lambda'][0], priors['lambda'][1], dims=('indicators_4'))
lambdas_4 = pm.Deterministic('lambdas4', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_4'))
lambdas_ = pm.Normal('lambdas_5', priors['lambda'][0], priors['lambda'][1], dims=('indicators_5'))
lambdas_5 = pm.Deterministic('lambdas5', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_5'))
tau = pm.Normal('tau', 3, 10, dims='indicators')
kappa = 0
sd_dist = pm.Exponential.dist(1.0, shape=2)
chol, _, _ = pm.LKJCholeskyCov('chol_cov', n=2, eta=priors['eta'],
sd_dist=sd_dist, compute_corr=True)
cov = pm.Deterministic("cov", chol.dot(chol.T), dims=('latent', 'latent1'))
ksi = pm.MvNormal('ksi', kappa, chol=chol, dims=('obs', 'latent'))
beta_r = pm.Normal('beta_r', 0, 0.5, dims='latent_regression')
beta_r2 = pm.Normal('beta_r2', 0, 1, dims='regression')
# SE_ACAD ~ SUP_FRIENDS + SUP_PARENTS
regression_se_acad = pm.Deterministic('regr_se_acad', beta_r[0]*ksi[obs_idx, 0] + beta_r[1]*ksi[obs_idx, 1])
# SE_SOCIAL ~ SUP_FRIENDS + SUP_PARENTS
regression_se_social = pm.Deterministic('regr_se_social', beta_r[2]*ksi[obs_idx, 0] + beta_r[3]*ksi[obs_idx, 1])
# LS ~ SE_ACAD + SE_SOCIAL + SUP_FRIEND + SUP_PARENTS
regression = pm.Deterministic('regr', beta_r2[0]*regression_se_acad + beta_r2[1]*regression_se_social +
beta_r2[2]*ksi[obs_idx, 0] + beta_r2[3]*ksi[obs_idx, 1])
m0 = tau[0] + regression_se_acad*lambdas_1[0]
m1 = tau[1] + regression_se_acad*lambdas_1[1]
m2 = tau[2] + regression_se_acad*lambdas_1[2]
m3 = tau[3] + regression_se_social*lambdas_2[0]
m4 = tau[4] + regression_se_social*lambdas_2[1]
m5 = tau[5] + regression_se_social*lambdas_2[2]
m6 = tau[6] + ksi[obs_idx, 0]*lambdas_3[0]
m7 = tau[7] + ksi[obs_idx, 0]*lambdas_3[1]
m8 = tau[8] + ksi[obs_idx, 0]*lambdas_3[2]
m9 = tau[9] + ksi[obs_idx, 1]*lambdas_4[0]
m10 = tau[10] + ksi[obs_idx, 1]*lambdas_4[1]
m11 = tau[11] + ksi[obs_idx, 1]*lambdas_4[2]
m12 = tau[12] + regression*lambdas_5[0]
m13 = tau[13] + regression*lambdas_5[1]
m14 = tau[14] + regression*lambdas_5[2]
mu = pm.Deterministic('mu', pm.math.stack([m0, m1, m2, m3, m4, m5, m6, m7,
m8, m9, m10, m11, m12, m13, m14]).T)
_ = pm.Normal('likelihood', mu, Psi, observed=df[drivers].values)
idata = pm.sample(10_000, chains=4, nuts_sampler='numpyro', target_accept=.99, tune=2000,
idata_kwargs={"log_likelihood": True}, random_seed=110)
idata.extend(pm.sample_posterior_predictive(idata))
return model, idata
model2, idata2 = make_indirect_sem(priors={'eta': 2, 'lambda': [1, 1]})make_ppc(idata2)
summary_df = az.summary(idata2, var_names=['beta_r', 'beta_r2'])
def calculate_effects(summary_df, var='SUP_P'):
#Indirect Paths
## VAR -> SE_SOC ->LS
indirect_parent_soc = summary_df.loc[f'beta_r[{var}->SE_SOC]']['mean']*summary_df.loc['beta_r2[SE_SOCIAL]']['mean']
## VAR -> SE_SOC ->LS
indirect_parent_acad = summary_df.loc[f'beta_r[{var}->SE_ACAD]']['mean']*summary_df.loc['beta_r2[SE_ACAD]']['mean']
## Total Indirect Effects
total_indirect = indirect_parent_soc + indirect_parent_acad
## Total Effects
total_effect = total_indirect + summary_df.loc[f'beta_r2[{var}]']['mean']
return pd.DataFrame([[indirect_parent_soc, indirect_parent_acad, total_indirect, total_effect]],
columns=[f'{var} -> SE_SOC ->LS', f'{var} -> SE_ACAD ->LS', f'Total Indirect Effects {var}', f'Total Effects {var}']
)
indirect_p = calculate_effects(summary_df, 'SUP_P')
indirect_f = calculate_effects(summary_df, 'SUP_F')
residuals_posterior_cov = get_posterior_resids(idata2, 500)py$summary_df |> kable( caption= "Regression Coefficients Amongst Latent Factors", digits=2) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| beta_r[SUP_F->SE_ACAD] | -0.01 | 0.04 | -0.08 | 0.07 | 0.00 | 0 | 6737 | 14576 | 1 |
| beta_r[SUP_P->SE_ACAD] | 0.78 | 0.10 | 0.60 | 0.97 | 0.00 | 0 | 2845 | 5689 | 1 |
| beta_r[SUP_F->SE_SOC] | 0.12 | 0.04 | 0.05 | 0.19 | 0.00 | 0 | 6051 | 12607 | 1 |
| beta_r[SUP_P->SE_SOC] | 0.86 | 0.10 | 0.68 | 1.04 | 0.00 | 0 | 2356 | 4448 | 1 |
| beta_r2[SE_ACAD] | 0.28 | 0.85 | -1.32 | 1.90 | 0.00 | 0 | 31374 | 29782 | 1 |
| beta_r2[SE_SOCIAL] | 0.32 | 0.82 | -1.23 | 1.85 | 0.01 | 0 | 21163 | 25065 | 1 |
| beta_r2[SUP_F] | 0.05 | 0.11 | -0.16 | 0.28 | 0.00 | 0 | 19210 | 21990 | 1 |
| beta_r2[SUP_P] | 0.36 | 0.76 | -1.05 | 1.79 | 0.00 | 0 | 27927 | 28806 | 1 |
py$indirect_p |> kable( caption= "Total and Indirect Effects: Parental Support", digits=2) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")| SUP_P -> SE_SOC ->LS | SUP_P -> SE_ACAD ->LS | Total Indirect Effects SUP_P | Total Effects SUP_P |
|---|---|---|---|
| 0.28 | 0.22 | 0.5 | 0.86 |
py$indirect_f |> kable( caption= "Total and Indirect Effects: Friend Support", digits=4) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")| SUP_F -> SE_SOC ->LS | SUP_F -> SE_ACAD ->LS | Total Indirect Effects SUP_F | Total Effects SUP_F |
|---|---|---|---|
| 0.0396 | -0.0017 | 0.038 | 0.091 |
plot_heatmap(py$residuals_posterior_cov, "Residuals of the Sample Covariances and Model Implied Covariances", "A Visual Check of Bayesian Model fit")Citation
@online{forde,
author = {Nathaniel Forde},
title = {Measurement, {Latent} {Factors} and the {Garden} of {Forking}
{Paths}},
langid = {en}
}